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ABSTRACT 

Load distribution schemes are presented which minimize the total data traffic in 
the Choiesky factorization of dense and sparse, symmetric, positive definite matrices 
on multiprocessor systems with local and shared memory. The total data traffic in fac- 
toring an n x n sparse, symmetric, positive definite matrix representing an n-vertex 

i + “ 2 

regular 2-D grid graph using n“, a< 1, processors is shown to be 0(n 2 ). It is 0(n 2 ), 

when n a , a > l, processors are used. Under the conditions of uniform load distribution 
these results are shown to be asymptotically optimal. The schemes allow efficient use 
of up to O(n) processors before the total data traffic reaches the maximum value of 

2 

0(n 2 ). The partitioning employed within the scheme, allows a better utilization of the 
data accessed from shared memory than those of previously published methods. 


Research supported by the National Aeronautics and Space Administration under NASA contract No, NAS 1-18 107 while the 
authors were in residence at ICASE, NASA Langley Research Center, Hampton, VA 23665. 
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1. Introduction 

Consider the problem of solving a system of linear equations 

Ax - b 

where A is an n x n sparse, symmetric, positive definite matrix, x is an n x 1 vector of 
variables, and b is an n x l vector of constants on a multiprocessor system with both 
shared and local memory. In this paper the total data traffic involved in factoring the 
matrix A using the column version of Cholesky decomposition is analyzed. The 
analysis is restricted to systems for 2-D regular grid graphs. The results developed 
here, however, can be applied to a wider class of problems. 1 

In [21, George et al. have given a parallel sparse factorization scheme for local 
memory multiprocessor systems that has a total data traffic of 0(n l + ° log 2 n) using n a 
processors. This result was improved to 0{n x +a ) in [4]. In this paper a scheme that 

1 +- 2 . 

has a total data traffic of 0(n 2 ) using n a , a si, processors is presented; with n a , 

2 

a> l, processors the resulting data traffic is 0(n 2 ). Although a multiprocessor system 
with both shared and local memory is assumed as the model of computation, the 
results developed here are equally applicable to systems with only local memories. 

In Section 2, some background work is discussed and the model of computation 
is specified. In Section 3, expressions for the total data traffic required in factoring 
dense matrices are developed. Using these results the total data traffic for sparse sys- 
tems is analyzed in Section 4. Conclusions are given in the final section. 


1 V. K. Naik and M. L. Patrick, Communication requirements of parallel sparse Cholesky factorizations , in preparation. 
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2. Some Preliminaries 

The model of computation used in our analysis is that of a multiprocessor system 
with a two level memory hierarchy such that each processor has local memory and all 
processors have access to a shared memory. The access cost per nonzero element in 
the shared memory is assumed to be unity for any processor. The total number of 
shared memory accesses from the beginning to the end of an algorithm is defined as 
the communication requirement or data traffic of that algorithm implemented on the 
multiprocessor system. 

The n x n matrix A considered here is assumed to represent a system 
corresponding to a 2-D regular grid graph, the vertices of which are ordered using the 
nested dissection method [3]. Figure 1 shows an ordering scheme for a 7 x 7 grid. 
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Figure 1: A 7 x 7 grid with nested dissection ordering. 
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Such an ordering has an optimal sequential operation count and fill-in. 

The basic algebraic scheme used in this paper to factor A is the column version of 
the Cholesky decomposition method [6]. An outline of this algorithm for factoring the 
n x n matrix A is as follows. Let A = LL T ; e A and l SJ e L. 

for j = l until n do 
begin 

Initialize ly = , i=j,-- ji. 

for it = l until j-l do 
for i = j until n do 

hj = hj ~ hji * hk » 
hj ~ 'fljj > 

for k = j+ 1 until n do 

ikj = Uj I b/> 


In the above algorithm for clarity, the values of bj are shown separately from those of 
a t j. In practice l tJ may overwrite a tJ . 

In the next two sections, parallel schemes for the above algorithm are developed 
for the distribution of work among processors of the multiprocessor system and their 
communication requirements are analyzed. The analysis assumes some familiarity 
with graph theory concepts related to matrix representations of systems of equations, in 
particular, the notion of vertices, edges, separators, subgraphs of a graph, and the 
correspondences between the vertices and the rows and columns of the matrix, the 
edges and the nonzero elements, and addition of edges and fill-in during the factoriza- 
tion of A. For details on these concepts see [3] and [7] and the references therein. 

Finally, the standard notations "0", "o'\ and "Ci" are used to characterize the 
asymptotic growth rates of computation and communication requirements. Precise 
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definitions of these notations can be found in elementary textbooks on the analysis of 
algorithms, e.g. |9J. 

3. Communication Requirements of a SPOCC Factorization Scheme for Dense 
Matrices 

First, a scheme based on the column version of the Cholesky decomposition [6] 
for factoring the lower triangular part of an m x m symmetric, positive definite, dense 
matrix using p processors is described. Without loss of generality, further assume that 

P = (r 2 + r) where r is an integer. The lower triangular matrix is partitioned into p 

partitions such that all, except r partitions, are s x 5 square blocks, where s= -y. The 

remaining r partitions which lie on the diagonal of the matrix are s x s triangular 
blocks. Each of these partitions is assigned to a unique processor. Initially, each pro- 
cessor reads the data for its partition from shared memory into its local memory. The 
r processors in charge of the partitions containing the left most s x s blocks of the 
matrix commence the computations of their part of the factorization. As soon as an 
element of the factor is computed, it is written into shared memory for access by other 
processors. As the necessary data becomes available, the remaining processors initiate 
computations on the blocks assigned to them. Each processor accesses from shared 
memory only those elements that are needed for local computation and each element is 
read no more than once by any processor. This parallel factorization scheme is termed 
as the submatrix-partition oriented column Cholesky (SPOCC) factorization scheme. 

Next the communication requirements of the above scheme are analyzed. In the 
following two theorems it shown that the total data traffic involved is 0(m 2 Vp) and that 
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this result is asymptotically optimal. First consider the data traffic associated with a 
generic square block / shown in Figure 2. In that figure, the darkened area represents 
the data elements that are required for the computations in block /. Let block / be 
bounded by columns ( 1 - 1)5 + 1 and is, and by rows (j-l)s + 1 and js where 1 < i < r and 
] <j <,r. The following lemma provides bounds for the communication cost of block /. 

Let a kJ be any element in block /. During the factorization this element undergoes 
a sequence of modifications before it arrives at a final value. The computations lead- 
ing to the final value of this element require the final resultant values of certain other 
elements and its own initial value. For determining the communication requirements 
the actual value, either final or initial, of an element accessed is not important. Only 
an account of this access to the shared memory is necessary. To keep the discussion 
brief, in the analysis of the data traffic, the term appropriate or required value of an 



Figure 2: The data traffic associated with block /. 
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element is used. An appropriate or required value of an element at a particular step 
of the computations is defined to be the value of that element that is required by the 
data dependencies of the algorithm at that step of computation. In the proof given 
below, a,, represents all the elements in row l of the lower triangular part of the matrix 
under consideration. 

LEMMA 1 : A total data traffic of (2 * - + —s is necessary and sufficient for fac- 

toring the square block 1; it is the same for all square blocks bounded by the columns 
(/— l )s + l and i s. The data traffic associated with a s x s triangular diagonal block 

bounded by the columns (i-l)s + l and i s, is ( i - + -jm. 

PROOF: Consider the computations at any element a kJ in block /. For these computa- 
tions the appropriate values of the elements a,, and a kl - for all /'< / are required; i.e., 
the data required are all the elements in row / and all the elements in row k on and to 
the left of column /. Now a kJ is any element in block I and hence, (j-l)s + l < k <js 
and (i-i).f + l < / < i s. Thus, the computations corresponding to any element in block I 
depend on all the elements in some row l of the factor, (/-l)$ + l < / < is, and on the 
elements to the left of column /-j+i in some row k, (J-l)s + l <k < js. Furthermore, no 
other information is needed in performing the factorization at that element. Therefore 
all the information in the above mentioned rows is sufficient for performing the com- 
putations at all the elements of block /. Conversely, the block / contains all the ele- 
ments a kJ such that (j-l)s + l <k< js and (i-l)j + 1 < / < is. The completion of the com- 
putations at these elements requires all the elements in rows / of the factor such that 
(i-l)s + 1 < / < i s and all the elements to the left of column j\s+ 1 in all the rows k such 
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that (/-l)j + l < k < js. Thus, it is necessary as well as sufficient to read the values of 
these specific elements for completing the factorization of the square block /. A count 
of the number of such elements gives a bound on the data traffic involved in factoring 
block /. It should be noted that this count includes the elements in block / which are 
locally computed, but the initial value of each of these elements has to be read prior to 
performing computations at that point and thus the communication cost remains the 
same. 

Thus, the data traffic that is both necessary and sufficient to factor block I is 

= 2(i-l>i 2 + + 

* <a- 

The above bound on the data traffic is independent of j and so it is the same for all the 
square blocks within the columns (i-l)s + 1 and is. 

Similarly, we can show that the data traffic associated with ans triangular diag- 
onal block bounded by the columns (i-l)s + 1 and i s is, (z - + ~s. 

□ 

Using these results we get a bound on the total data traffic as stated in the following 
Theorem. 

THEOREM 1 : The total data traffic involved in the SPOCC factorization of an m x m 
dense matrix using p processors is 0(m 2 Vp). 
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PROOF: The total data traffic associated with all the blocks bounded by columns 
(i- 1 ).v + 1 and is, 1 <, i £ r, is given by, 


(r -i) x data traffic associated with a square block 

+ data traffic associated with a triangular block bounded by the given columns. 

From Lemma 1 and the fact that there are r such column partitions, we get the total 
data traffic involved in factoring the m x m dense matrix using the SPOCC factoriza- 
tion as 


s 

1=1 


(r-0((2i - !)•/ +± s) + «i - + 1-s) 


Ignoring the lower order terms, the total data traffic is 


= £(2ri-2/V 

i=i 


< 



Since p = (r 2 + r) and Szz the total data traffic is Ofm 2 Vp). 


□ 


The above theorem gives an upper bound on the data traffic involved in factoring 
the m x m matrix using the submatrix assignment scheme. In the following theorem 
we prove that this result is optimal in the order of magnitude sense. Before giving the 
proof, we again consider the data dependencies involved in computing an element of 
the factor. Consider the computations at any element a, v . To compute the value of the 
corresponding element in the factor, values at all the elements in row j and the values 
of elements in column 1 through j of row i are needed. Thus, any off-diagonal ele- 
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ment a tJ requires values of 2j elements for completing the computations and any diago- 
nal element (when i = j) requires values of j elements. We now have the following two 
obvious observations: 

(1) The values at all the elements in any row i can be used to complete the computa- 
tions at exactly one element, namely, the diagonal element a iti . 

(2) If i and j are any two rows, such that i > j, then the values of the elements in 
these two rows can be used to complete computations at exactly one off-diagonal 
element a tJ . No other information is needed to complete the computations at that 
element. 

These observations lead to the following lemma on the number of elements at which 
the computations can be completed by acquiring information from exactly k rows. 

LEMMA 2 : If the required values of all the elements in any k rows are available then 

a l 

the computations of at most — + — elements can be completed; conversely, the com- 

a k 

putations corresponding to any — + — elements depend on at least one element from 
each row of at least k distinct rows. 

PROOF: We prove the first part of the lemma by induction on the number of rows. 

It? k 

When k = l, — + — = l, which is true from Observation (1). This gives the basis for 

the inductive proof. Now assume that the result is true for k- 1 rows; i.e., assume that 

(k~\) 2 (k— 1) 

with the values of all elements in *-1 rows computations of at most v ele- 

ments can be completed. In addition to the values of all the elements in the *-1 rows, 
let the values of all the elements in the *th row be known. For the *th row there is a 
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unique diagonal element, whose value can be computed using only the information 
from that row (Observation (1)). From Observation (2), using the information from 
each of the previous k-\ rows and the kth row, computations at one unique new ele- 
ment can be completed. Thus, by knowing the appropriate values at all the elements 
in the *th row, computations of at most k extra elements can be completed; i.e., the 
values of all elements in the k rows can be used to complete computations of at most 
(£-1) k 2 k 

■ ■ ■■■ + —— + k = — + — elements. This completes the inductive proof of the first 

part of the lemma. 

a k 

To prove the second part, assume that there is a set of — + — elements such that 

all the computations at these elements depend on values from less that k rows. Since 
the computations at any element depends on its own initial value, it is obvious that for 

i a k 

the above assertion to hold, these — + — elements must lie on less than k rows. 

2 2 

Assume that k - e rows hold these elements and that information from k-t rows is 
sufficient to complete the computations at these elements, for e > e' > 1. From the first 
part of the lemma, all the information from any it - e rows can be used to complete 

computations of at most — ^ ■■■ + - elements. This implies that there are at least 

e (k - — + — ) > 0 elements in k - e rows where the computations cannot be completed 

using the information in the k-t rows - a contradiction. Therefore information from 

k 2 k 

at least k rows is needed in order to complete the computations at — + — elements. 


□ 
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Next the above described scheme with a total data traffic of Ofm 2 -ip) in factoring 
the m x m matrix using p processors is shown to be optimal in its communication 
requirements. 

THEOREM 2 : Assuming that the computational load is to be uniformly distributed 
among p processors, the data traffic involved in computing the Cholesky factor of an m 
x m dense matrix is Cl(m 2 'Ip). 

PROOF: Consider the computations corresponding to the elements in the set 
S = [djj l i,j > y }. In Figure 3 the gray area FGH denotes this set of elements. The 

3 

total computational work in factoring the m x m matrix is — + o(m 3 ) and that in factor- 

6 

m 3 

ing the part corresponding to area FGH is — + o(m 3 ). This implies that out of p pro- 



Figure 3. 
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cessors, c x p processors must be assigned to complete the factorization of area FGH, for 

3 3 

some constant 0 < ci < 1, to achieve a uniform load distribution of c 2 — + o(— ) work 

P P 


per processor. 


Let n be some partitioning of elements, that partitions the elements in the set 5 in 
c x p subsets such that each subset has the same amount of computational work. It can 
be shown that for any such partitioning n there are constants c, c" > 0, such that the 


number of elements in any subset are in the interval 



Ignoring the lower 


order terms each subset defined by the partitioning IT has c 2 ~ amount of work. Now 

the element a m<m in the set S has the maximum amount of computational work associ- 
ated with it, which is proportional to m since a mm is modified m times. Therefore, the 


' /H 2 

number of elements in any subset must be greater than or equal to c — , for some 


constant c. Similarly, for any element «. (- m +1) » i- e -> an element in the column y + 1, 

the computational work is the minimum over the set S. This minimum computational 

work is proportional to y. Hence the number of elements in any subset defined by n 

2 

must be less than or equal to c" — for some constant c . 

P 

2 

Let ci — , c < Ci < c", be the number of elements in the ith partition defined by n. 


2 

From Lemma 2, computations at any c,~ elements depend on information from at 


least 


F> — PL 1 


= cr—r rows where c- is a constant. Now for any element a,-.- e S, j 

<p 


is greater than — . Hence the values of at least y elements from each of the c\~ 
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rows is required in completing the computation of any subset of elements defined by 

Q . 

n. Thus the data traffic associated with each partition is at least Since there 

£. 

are c x p such partitions, the data traffic involved is at least ci ~n?-Jp. 

Therefore under the condition of uniform load distribution, the data traffic 
involved in computing the Cholesky factor of an m x m dense matrix using p proces- 
sors is Q(m 2 ^lp). 

□ 


It should be noted that, in an order of magnitude sense, the computational work 
associated with each block in the SPOCC factorization scheme presented earlier is the 
same as that of a partition under the uniform load distribution scheme. Thus the data 
traffic associated with that scheme is optimal in an order of magnitude sense. 

4. Communication Requirements of a SPOCC Factorization Scheme for Sparse 
Matrices 

In this section the total data traffic of a SPOCC factorization scheme in factoring 
a sparse, symmetric, positive definite matrix A corresponding to a regular n-vertex 2-D 
grid graph is analyzed. A 9-point stencil, unless otherwise stated, is used. In the fol- 
lowing discussion L denotes the lower triangular Cholesky factor of the matrix A. 

2 

First, the worst case communication requirement complexity result of 0(n 2 ), which is 
independent of the number of processors used, is obtained and then a scheme that 

l + -2- 

reduces the communication requirement complexity to 0(n 2 ), when n a , a<l. 
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processors are used, is presented. The result is shown to be optimal under the condi- 
tion of uniform load distribution. 

Clearly, the communication requirement is the worst when the use of local 

memory is not allowed. Now consider the computations involved in computing a 

/-i 

nonzero element ly e L. Recall that the expression ay - £ lyljjt is computed first, fol- 

lowed by a single division by l tJ . Thus, for each multiplication there is one subtraction 
operation, at most one division and three memory references and a constant overhead 
such as index computation. Let the values of all the elements of the lower triangular 
part of matrix A and those of L be stored in the shared memory. Any number of pro- 
cessors are allowed to participate in computing a nonzero element of the factor pro- 
vided that no single operation is performed by more than one processor. The follow- 
ing theorem gives the bound on the total data traffic under these assumptions. 
Although the result is obvious, it is useful because it is independent of the number of 
processors used and gives the worst case bound on the data traffic even for the models 
of computation that are more restrictive. 

THEOREM 3 : The worst case data traffic associated with factoring the matrix A is 
2 

0(n 2 ). 

PROOF: For the assumed model of computation, each multiplication operation requires 
at most a constant number of memory references. Thus, the total data traffic is 
< const ■ number of multiplication operations. 

From fl], the number of multiplication operations associated with factoring matrix A is 

2 2 
0(n 2 ). Hence, the total worst case data traffic is 0(n 2 ). 
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□ 


Returning to the model of computation used to get bounds for the dense matrix 
case, the above worst case bound on the data traffic is improved when the number of 
processors is n“ where a < 1. This is possible because now the local memory is used 
to store the data that is needed in more than one computations. 

First, bounds are obtained on the data traffic associated with computing the ele- 
ments of the Cholesky factor along the columns corresponding to a generic "+" shaped 
separator in the nested dissection ordering of a 2-D regular grid graph. Let the order- 
ing scheme be such that the vertices on vertical part of a "+" separator are ordered 
after those on the horizontal part. Let each of the two horizontal parts of the "+" 
separator be referred to as the horizontal sub-separator and the single vertical part be 
referred to as the vertical sub-separator. 

Let Ttf = { k I it <j and 1^*0, /,•* e L ); i.e., ttf is the set of all columns of the factor 

L to the left of the column j+l such that the elements in row i of these columns are 
* 

nonzero. Let Tft* = i.e., Ttf* is the set of all the columns to left of column j+ 1 

x=i 

such that on each of these columns there is a nonzero element in at least one of the i 
through k rows of the factor. Let r represent any m-vertex sub- separator. It is assumed 
that all the vertices in any sub-separator are ordered consecutively. Let low(T) and 
highly) be the indices of the lowest and the highest ordered vertices, respectively, on 
the sub-separator r. Note that high(T) - low(r) + 1 = m. The following lemma estab- 
lishes some basic sub-separator related properties that are useful in analyzing the com- 


munication requirements. 
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LEMMA 3 : Let r be any m-vertex sub-separator, (i) Corresponding to the vertices of 
l there is a dense mx m triangular diagonal block in the Cholesky factor, (ii) In the 
factor L, the columns low(F) through high(T) contain at most four off-diagonal rectangu- 
lar blocks with nonzero elements. Each of these blocks is of size at most (ci-m+c^) x m 
where c, < 2 and c 2 < 3 are integer constants. Any nonzero element in these columns is 
in one of these four blocks or in the diagonal triangular block and no where else. 

PROOF: The first part of the lemma is obvious. In Figure 4, the sub-separator r 
separates the vertices in regions R { and R 2 . Since the vertices in these two regions are 
ordered ahead of those of r, the fill-in due to the elimination of vertices in regions Ri 
and R 2 ensures a dense mxm triangular diagonal block bounded by columns lowQT) and 
high(T) as shown in Figure 5. 


r 3 



n 

Ri 

r* 

r 4 

i 

r 2 

r 2 


Figure 4: Sub-separator r with four surrounding sub-separators. 
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To prove the second part of the lemma, consider Figure 4 again. In that figure 
the thickness of the lines qualitatively indicate the separator levels in the nested dissec- 
tion ordering. Let r,, r 2 , r 3 , and r 4 be the four partial sub-separators that surround the 
sub- separator r. Because of the nature of the nested dissection ordering the vertices of 
r are "connected" to only those higher ordered vertices that lie on r l5 r 2 , r 3 , and r 4 
and to no other vertices.t Thus, all the nonzeros on columns low(r) through high(T) in 
rows below high(T) are confined to only the rows corresponding to the vertices on n, 
r 2 , r 3 , and r 4 . Furthermore, each vertex in r is "connected" to every vertex on these 
four partial sub-separators and hence, the four rectangular blocks are dense. This is 
shown schematically in Figure 5. It can be verified that if r is a horizontal m- vertex 



Figure 5: Off-diagonal blocks with nonzeros corresponding to sub-separator r. 


t Vertex u is said to be "connected" to vertex v, if there exists a path [«, u ]v Uj, • ■ • , u k% v] of length one or more in the 
grid graph such that index (Uy) < min (index(u\index(v)) t for 1 £ r £ k\ i.e» if l S j * 0, g L where i = index(u), j * indexiv). 
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sub-separator, then the surrounding box of vertices is of dimension (2m+3) x (m+2). 
Therefore there are two rectangular off-diagonal dense blocks of dimension at most 
(2m+3) x m and the other two of dimension at most (m+2) x m. Similarly, if r is a 
vertical m-vertex sub-separator, there are four off-diagonal rectangular blocks of 
dimension at most (m+2) x m in the factor. If r is not surrounded on all four sides 
then some these blocks will be missing. 

□ 


Using the above stated sub-separator properties, the following lemma gives the 
data dependencies of the nonzero elements in any of the five blocks specified in 
Lemma 3. The lemma shows that the number of nonzero elements in any row i of the 
factor L is less than c m where c is an integer constant and m is the size of the sub- 
separator to which the vertex corresponding to row i belongs. It is then shown that for 
any row the computations at all the elements ly e L, lowiT) <jz highfJT), for some m- 
vertex sub-separator r require a total of less than c m nonzero elements from that row. 
Note that this count is independent of the sub-separator to which the vertex 
corresponding to row i belongs. Thus, the computations at all the elements in a row of 
any of the five blocks specified in Lemma 3, require only c m elements from that row 
irrespective of the relative location of the off-diagonal blocks in the factor. 

LEMMA 4 : Let r be any m-vertex sub-separator. The nonzero elements from row i, 
i > low(T), required in completing the computations of all elements l tJ e L such that 
loMD < j < high{V), are those elements in row i on the columns in the set given by, 
p) T\?‘ h<r> . For all i > low(r), the number of such useful nonzero elements 
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in row i is < c m for some constant c. 

PROOF: Any nonzero element /, v e L, iz low(T) and lowQT) Zj<, high(T ), is in one of the 
five blocks specified in Lemma 3. Hence to prove the result of this lemma only the 
rows that intersect one of these blocks need to be considered. The result for 
/o*v(r) < i < high(T) is first proved followed by that for i > high(T). 

When low(D <, i < high(r), n T l.^* <n = since, 

q*‘** ( ° c . By definition, the set nf“* A<r) contains all the columns that have a 

nonzero element in row i. Clearly, the nonzero elements from the row i required in 
completing the computations at all the elements / v e L, low(T) <j<, high(T), are on 
columns in the set q*‘** (n . 

To measure the size of the set qf**^ note that it is bounded by the number of 
vertices ordered ahead of the vertex i and which are "connected" to vertex i. Using the 
recursive nature of the nested dissection ordering it can be verified that in the case of 
constant degree grid graphs and when low(JT) £ i £ high{T), the size of the set i s 

bounded by c m, where c is a constant dependent on the degree of the graph and on 
whether r is a horizontal or vertical sub-separator. If r is a horizontal m-vertex sub- 
separator, then for a 5-point stencil, c = 7 and for a 9-point stencil, c = 11. When r is 
a vertical m-vertex sub-separator, the values of c are 5 and 7, respectively. This com- 
pletes the proof when low(V) < / < high(V). 

The case where / > high(T) is considered next. As shown above, || || 

depends on the size of the sub-separator to which the vertex i belongs and hence when 
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i > high(T), || n!“*' ,<n II can be much greater than 0(m ) where m is size of r.f However, 
when the computation of only those elements in row i that lie on columns low(V) 
through high(r) are of concern, each of these computations consists of a product of a 
nonzero element in row i and a nonzero element in one of the rows low(T) through 
high(D in the column higher) or in some other column to the left of it. Thus, for these 
computations only the columns that have a nonzero element in row i and in row j 
where low(r) <,j< high( D, are of interest. Now the set consists of all the 

columns that have a nonzero element in at least one of rows low(T) through highijr). 
Similarly, consists of all the columns that have a nonzero element in row i. 

Clearly, the set Pi consists of all the columns which contain all the 

pairs of nonzero elements that can be usefully employed in completing the computa- 
tions at all the elements l tJ , W(0 <j <, higher). Thus, the nonzero elements from row i, 
i > high(r), required in completing computations at all the elements e L such that 
low(r) <j< high(T), are those elements in the row i on the columns in the set given by, 
P n^ A(r) . 

To get a bound on the size of P T l^** <n » consider the m-vertex hor- 

izontal sub-separator r shown in Figure 4. It is surrounded by sub-separators H, r 2 , 
r 3 , and r 4 . Suppose that /< 9 w(r,) < / < highfTi). Now, the set p con- 

sists of columns corresponding to vertices on r or corresponding to those vertices 
ordered ahead of them which are "connected" to at least one vertex in r and to the 
vertex corresponding to row /. Using the recursive ordering of the nested dissection 


f || U || denotes the cardinality or the number of dements in set U. 
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scheme it can be shown that the number of such vertices is less than 7m. Thus, 
II n II * lm > for low ( r i) * i z Wi). The same bound is obtained 

when low(V 4 ) < i < high( r 4 ). If low(r 2) < i < high(r£ or /<w(r 3 ) < i < high(T 3 ) then it can be 
verified that, || Ti?'** (n II ^ 3m. If r is vertical sub-separator the two 

bounds are 5m and —m respectively. 

□ 


The scheme for completing the computations in all five dense blocks associated 
with an m-vertex sub-separator r using p processors is now described. If p = 1, that 
processor accesses the necessary values and completes all the computations. If p > 1, 
the factorization corresponding to the m x m triangular diagonal block is first com- 
pleted using all the processors and then the factorization corresponding to the four 
off-diagonal blocks. For the first part, as stated in the previous section for the dense 

7 

matrices, the m x m dense diagonal block is partitioned into — - square blocks and 

2 

r diagonal triangular blocks each of size x -y- where p = ~ and each of these 

P partitions is assigned to a unique processor. Each processor completes the computa- 
tions corresponding to its partition by accessing the required data from the shared 
memory. For the purpose of factoring, the off-diagonal blocks are treated as if they 
were adjacent, and the resultant rectangular block is partitioned into p sub-blocks each 


of size 


cm 

'Jp 


x —r, where c < 6 for a horizontal sub-separator and c < 4 for a vertical 


sub- separator. Again each partition is assigned to a unique processor. 
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Using the results from the previous lemma, a bound is obtained on the data traffic 
associated with the computations performed as outlined in the above scheme. 


LEMMA 5 : The total data traffic associated with completing the computations, using 
p processors, in all the dense blocks within the columns low(T) through high(T), for 
some m-vertex horizontal sub-separator r, is bounded by (53 + llV2)m 2 -Vp. It is 
bounded by (28 + %Ti)m 2 <p for an m-vertex vertical sub-separator. 

PROOF: Let r be an m-vertex horizontal sub-separator. Assume that this sub-separator 
is surrounded on all four sides by vertices ordered higher than any vertex on the sub- 
separator. Such a sub-separator has the worst case communication requirements among 
all the m-vertex sub-separators. 

First, compute the data traffic associated with factoring the triangular diagonal 

r 2 r 

block using p = — + — processors. Each of the sub-blocks requires nonzero elements 

9m 

from at most — rows out of the m rows in the range fow(r) through high(T) of the fac- 
tor. No other information is needed. From Lemma 4, each of these rows has at most 
llm nonzeros. Thus the communication requirement of each partition is at most 

l im-2-^r and the total communication requirement of the triangular block is, bounded 

V 2p 

by lW2mH?. 


Now consider the data traffic associated with the off-diagonal blocks. Each parti- 


tion is of size -^p x -p. Thus, each partition requires nonzero elements from 


rows which are below the row high{T) in the factor. From Lemma 4, each of these 
rows has at most 7m nonzeros that are useful in completing the computations in any of 
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the partitions. Each partition also requires information from rows from the region 

w 

low(V) through high(T). Each of these rows has at most 11m nonzeros. Thus, the com- 
munication requirement of each partition is at most 7 m~ + = ^2- and the 

Vp Vp Vp 

total communication requirement of completing the computations at the off-diagonal 
blocks using p processors is less than or equal to 53mHp. 

Adding the communication costs corresponding to the diagonal and the off- 
diagonal blocks we get the total data traffic associated with r to be less than or equal 
to (53 + 1 lV^m 2 Vp. 

A similar analysis can be used to compute the data traffic when r is an m-vertex 
vertical sub-separator and can be shown to be bounded by (28 + 

□ 


Applying the results from the above lemma, a bound is obtained on the total data 
traffic in factoring the sparse matrix A associated with the 2-D grid graph. First a 
description of the overall load distribution scheme among the processors and some 
notation are given. For the sake of simplicity, assume that the w-vertex 2-D grid graph 

is a n 2 x n 2 square grid and that there are n a processors, as 1. The vertices of the 
grid are ordered using the nested dissection ordering scheme such that the vertices in 
each square sub-grid are consecutively ordered. In Figure 1, such an ordering for a 7 
x 7 grid is shown. This particular ordering is chosen only for the sake of describing 
the scheme. The results presented here are applicable to other nested dissection order- 

1 a 1 a 

ing schemes as well. This ordering results in n“ sub-grids each of size n 2 2 x n 2 2 . 
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The computations at all the nonzero elements corresponding to the vertices of each 
such sub-grid is assigned to a unique processor. After completing the work, two pro- 
cessors from adjacent sub-grids combine to complete the work corresponding to the 

1 « 

n 2 2 -vertex horizontal sub-separator that separates the two sub-grids. This is followed 

1 a 

by the work on the 2 n 2 2 -vertex vertical sub-separator. Four processors work on each 
such vertical sub- separator. This process is continued until the factorization of the 
entire matrix is completed. Whenever more than one processor is working on any 
sub-separator, the scheme presented earlier for a generic m-vertex sub-separator is 
applied. 

Let x* (m. p, k) represent the data traffic using p processors in completing the com- 
putations at all the nonzero elements /, v € L in the columns corresponding to an m- 
vertex horizontal sub-separator that is surrounded by higher ordered vertices on k sides. 
Let x v (m, p, k) represent the same for an m-vertex vertical sub-separator. From Lemma 
5, x h (m,p, 4) < (53 + lW2)/n 2 Vp and x v (m, p, 4) S (28 + 8 'l2)nt l -'lp. Let x t (m,p,k) 
represent the total data traffic, using p processors, in completing the computations 
corresponding to all the sub-separators within an m-vertex sub-grid that is surrounded 
by higher ordered vertices on k sides. 

The following theorem gives an upper bound on the total data traffic in factoring 
the matrix A associated with an n-vertex 2-D regular grid graph using n a processors 
with the scheduling scheme as described above. 

THEOREM 4 : The total data traffic in factoring the n x n sparse matrix A using n a 

I+-2. 1+^- 

processors is 0(n 2 ); i.e., x f («, n a , 0) = 0(n l ). 
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i 1 1 

PROOF: On a n 2 x n 2 regular grid there is an n 2 -vertex vertical sub-separator and 
, i 

two —n 2 -vertex horizontal sub-separators (ignoring the additive constant -1). The 

vertical sub-separator is not surrounded by any vertices that are ordered after the ver- 
tices on the vertical sub-separator. Each of the two horizontal sub- separators are sur- 
rounded on one side. These three sub-separators subdivide the n-vertex grid graph into 

i I i i 

four sub-grids of size —n 2 x — n 2 , each surrounded on two sides by higher ordered 


vertices. Thus, the total data traffic in factoring the corresponding matrix A is given 


by. 


x* («.«“, 0) = x, (n 2 , n a , 0) + 2T*(-in 2 ,|n a ,l) + 4x, ( |/», jn a , 2). 


A recursive expansion of the above expression contains data traffic terms for vert- 
ical sub-separators of different sizes that are surrounded on zero sides, two sides, three 
sides (in two different ways), and on all four sides by higher ordered vertices. It also 
contains data traffic expressions for horizontal sub-separators of different sizes sur- 
rounded in five different ways. To keep the analysis simple, it is assumed that all the 

four sub-grids of size -^n 2 x --n 2 are surrounded on all four sides. This 


simplification results in a conservative expression for the data traffic, but affects only 
the constant terms in the bound. Thus, 

1 , — i i i 

x g (n, n a , 0) < x v (n 2 ,n a , 0) + 2x* ( -jn 2 , — n“, 1) + 4x g ( —n, —n a , 4). 


Now, 


(•7 n -T na ’ 4 ) = x v ( ~n 2 , -j« a , 4) + 2x*(-j'» 2 .-5'' ,a ’ 4 ) + 4t < ( ■j^"* 4) ‘ 



V 
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From Lemma 5, it follows that, 


x, ( |n. |n“, 4) £ -i- (134 + 85V2)n' + 2 . 


16 


An analysis similar to that given in Lemma 5 yields 


and 


x„ (« 2 , 0) S 8V2-n 2 . 


X* ( \n\ |n°, 1) S |(22 + 2S<l)n + 2 . 


Thus, we get, 


1 /- 1+ T 

X g (n,n a , 0) 5 -L< 78 + 71l/ 2)-« 


□ 


In the following theorem the communication bound of the above presented 
scheme is shown to be optimal in an order of magnitude sense, by giving a lower 
bound proof. 

THEOREM 5 : Under the condition of uniform load distribution, the data traffic in 

factoring the n x n sparse matrix A, using n a processors, a < l, is Q(n 2 ). 

PROOF: For a regular 2-D grid graph with n vertices, the separator size for nested 

dissection ordering is n 2 \7\. From Lemma 3, it follows that the factor L has an 
\_ j_ 

n 2 x n 2 dense triangular diagonal block incorporated in it. From Theorem 2, the data 
traffic involved in completing the computations associated with the elements of this 
dense triangular block, under the condition of uniform load distribution using n a pro- 
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i + 

cessors, is Q(n 2 ). Since the factorization of A cannot be completed without complet- 
ing the factorization of this dense block, the result follows. 

□ 


From Theorems 4 and 5, it is clear that the load assignment scheme described 
here for factoring the n \ n sparse matrix using n“ processors is optimal in an order of 
magnitude sense. Note that when n a , a > 1 , processors are used, the data traffic bound 
given in Theorem 3 holds. 

5. Conclusions 

In this paper load distribution schemes for multiprocessor systems with local and 
shared memory are presented for factoring dense and sparse symmetric, positive 
definite matrices. The communication requirements of these schemes are analyzed and 
are shown to be asymptotically optimal for systems corresponding to regular 2-D prob- 
lems where the vertices of the grid graph are ordered using the nested dissection order- 
ing methods. For these schemes it is shown that the total data traffic in factoring an 

n x n dense, symmetric, positive definite matrix is 0{n 2 ) when n a processors are 

used. The total data traffic in factoring the n x n sparse, symmetric, positive definite 

i + -2. 

matrix corresponding to a 2-D grid problem, is shown to be 0(n 2 ) when n“, a£ 1, 

2 

processors are used. When n a , a > l, processors are used the total data traffic is 0(n 2 ). 
Under the condition of uniform load distribution, these results are shown to be optimal 


in an order of magnitude sense. 
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In [4], George et al. have given a load assignment scheme for factoring the 
matrix A that has a total data traffic of 0(n 1+ “) when 0(n a ) processors are used. In this 
assignment scheme, the computational work corresponding to an entire column of the 
matrix A is performed by no more than one processor. In other words the work associ- 
ated with any column is treated as indivisible. Under the condition of uniform load 
distribution and a column-level indivisibility, in [5] and [8] it is shown that the total 
data traffic of that scheme is asymptotically optimal. The scheme presented in this 

1 +t 

paper reduces the communication requirement to 0(n 2 ) by removing the constraint 

of column-level indivisibility. Here the indivisible work unit is the computation 
corresponding to a nonzero element in the factor. Note that this does not necessarily 
enforce the use of a finer grain computational model. However, the SPOCC factoriza- 
tion scheme allows an efficient use of up to 0(n) processors before the total data traffic 

2 1 
reaches the maximum value of 0(n 2 ), whereas in the former scheme only up to 0(n 2 ) 

processors may be efficiently used before the total data traffic reaches the maximum. 

Another notable aspect of the scheme presented in this paper is that the reduction 
in the communication requirements is brought about by improving the utilization of the 
data accessed from shared memory by each processor. Consider the factorization of an 
m x m dense matrix. Let the data utilization of a data element accessed by a proces- 
sor be defined as the number of computations in which that element is used by that 
processor divided by m. Since an element in the factor is needed in at most m compu- 
tations, the maximum utilization of any data accessed is one. Let the aggregate data 
utilization for a processor be defined as the average utilization of the individual data 



29 


elements accessed by that processor. In the SPOCC factorization of an m x m dense 

2^,2 

matrix, each processor accesses at most — j=~ elements from the shared memory and 


each element is used in at least computations. Thus, the utilization of each data 


accessed is and so is the aggregate utilization of all the data accesses. On the 


other hand, with the column-level work assignment scheme, each processor accesses 

0(m 2 ) elements from the shared memory. Of these, only O(-) elements have an utili- 

P 

zation of one and the data utilization for the remaining elements is ~ which gives an 

aggregate data utilization of approximately — . Similar improvements in data utiliza- 

P 

tions are obtained in factoring a sparse matrix. 

It should be noted that the square shape of the submatrix partitions produce the 
best possible aggregate utilizations. For the algorithm considered here, the data depen- 
dencies are such that rectangular and square partitions give rise to high data utiliza- 
tions. Since the square partitions have the minimum perimeter for a given area, the 
number of data elements accessed (which is proportional to the perimeter of the parti- 
tion) for a given work load (which is proportional to the area enclosed), is also a 
minimum for the square partitions. 


An effect of the improvement in the aggregate utilization of data and the resulting 
reduction in the communication requirements is the segregation of the accesses to the 
shared data. Since the total data traffic in factoring an m x m dense matrix using p 


processors is 0(n? ■ ip), on an average each processor accesses only <9(-^=-) data. Note 
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that the total shared data is 0(m 2 ). Thus, on an average each element in the shared 
memory is accessed by O(Vp) processors. The column-level assignment scheme, how- 
ever, has a total data traffic of 0(n^ ■ p) and thus, on an average each processor 
accesses 0(n?) data or on an average each element in the shared memory is accessed 
by 0(p) processors. An obvious implication of this observation is that for the scheme 
presented here, not only is the total data traffic reduced but also the requests at indivi- 
dual shared addresses. This can have considerable impact on the performance of the 
systems with a large number of processors. 

In the factorization scheme presented here, the work assigned to each processor is 
balanced in an order of magnitude sense which is sufficient for carrying out the 
analysis. For practical purposes it is necessary to consider the effect of the constants 
involved on the load imbalance. Schemes for reducing the load imbalance caused by 
the constant terms in the column-level assignment scheme and in the SPOCC scheme 

are presented elsewhere. 2 Some implementation aspects are also considered there. 
Finally, the scheme developed here for factoring a sparse matrix is applicable to a 
wider class of problems. The the communication requirements for these problems are 

found to be optimal as well. 2 

In summary, we have presented schemes that reduce the data traffic involved in 
factoring both dense and sparse, symmetric, positive definite matrices on multiproces- 
sor systems with a two level memory hierarchy. These reductions are made possible 
by improving the utilization of each data element accessed. 


2 V. K. Naik and M. L. Patrick, Communication requirements of parallel sparse Cholesky factorizations f in preparation. 
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